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Summary 

Many public and private genome-wide association studies that we have analyzed include flaws in de- 
sign, with avoidable confounding appearing as a norm rather than the exception. Rather than recognizing 
flawed research design and addressing that, a category of quality-control statistical methods has arisen to 
treat only the symptoms. Reflecting more deeply, we examine elements of current genomic research in 
light of the traditional scientific method and find that hypotheses are often detached from data collection, 
experimental design, and causal theories. Association studies independent of causal theories, along with 
multiple testing errors, too often drive health care and public policy decisions. In an era of large-scale 
biological research, we ask questions about the role of statistical analyses in advancing coherent theories 
of diseases and their mechanisms. We advocate for reinterpretation of the scientific method in the context 
of large-scale data analysis opportunities and for renewed appreciation of falsifiable hypotheses, so that 
we can learn more from our best mistakes. 
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1. Introduction 

Often deeper investigation into what appears as an isolated occurrence leads to appreciation of a broader 
problem. Our starting point for this discussion was observing major batch effects in all but 2 of 30 public 
and private genome-wide association studies (GWAS) that the first author analyzed over the past several 
years. Problematically, the dependent variable, case-control status, was correlated with the order of sam- 
ple collection and/or the order in which samples were batch-processed on genotyping instruments. An 
example of this can be found in the seminal Wellcome Trust GWAS Study (Wellcome Trust Case Control 
Consortium [WTCCC], 2007). Members of the consortium independently collected and extracted case and 
control DNA at different sites; then control samples were genotyped on a series of 96-well plates, while 
case samples were genotyped on another series of plates, sometimes including more than one disease 
on a given plate — but cases and controls were not block randomized. Because experimental conditions 
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vary over time and site, with the unit of the plate providing the largest source of variability from one run 
to the next, the confounding of case-control status with experimental order was severe. Lambert (2010) 
contrasted "unfiltered" single nucleotide polymorphism (SNP) and copy number variant (CNV) Manhat- 
tan plots for the WTCCC study and a block-randomized Alzheimer's GWAS, finding the former riddled 
with false positives significant at 10" 20 (SNP GWAS) and 10" 200 (CNV GWAS) levels and the latter 
containing no such artifacts. Early mistakes like this are expected. Our concern is that we detect and learn 
from them. Are we poised to make similar mistakes with high-throughput sequencing, where hundreds or 
thousands of spurious associations may lead investigators to questionable conclusions? We cannot answer 
honestly without examining the philosophy and premises that underlie our experimental methods and the 
culture in which we execute them. 

Our GWAS concerns promise relevance for observational studies in general, which remain pivotal 
to biological sciences. Here we first describe recurring symptoms of a problem emerging in some pub- 
lished GWAS. Next we recall elements of the traditional scientific method to raise questions about how 
association studies of large observational data sets can advance biological understanding, particularly by 
embracing and exploring falsifiable hypotheses. Rather than offer answers, we seek to foster substantive 
conversations, explorations, and yes, more questions about how our field can persevere in the highest- 
quality research design and implementation that genomic and similar molecular research can provide. The 
public, medical research practitioners, funding organizations, and we ourselves need these discussions to 
achieve society's shared goals for health and well-being. 

2. Symptoms of a broader problem 

A common symptom of the problems we investigated is confounding of the outcome of interest with 
experimental order. One of us raised the issue in a blog post (Lambert, 2010), and Leek and others (2010) 
further elucidated the problem. Alarmingly, confounding can occur many times over by, for example, 
collecting DNA for cases from sites distinct from those providing DNA for controls, running the cases and 
controls on different genotyping platforms and on separate sets of plates, and then calculating statistical 
associations between case-control statuses on these questionably measured genotypes. As an illustration, 
we analyzed a trio study genotyped by the Broad Institute in which plating placed the fathers' DNA 
samples on one set of plates, the mothers' samples on another, and the children's samples on a third set. 
As a result, real genetic differences among family members became confounded with errors in genotyping 
from plate to plate, making results problematic for publication. We have also seen instances in which 
DNA is extracted from blood in controls and buccal or saliva cells in disease cases, and then hypotheses 
comparing cases and controls are confounded with performance characteristics of genotyping on these 
different DNA sources. 

If this level of confounding seems a thing of the past, one need to look no further than the recent 
Science paper "Genetic signature of exceptional longevity in humans" (Sebastiani and others, 2010), 
which underwent editorial reevaluation (Alberts, 2010) and subsequent retraction (Sebastiani and others, 
201 1). Interestingly, the editorial concern and subsequent retraction cite a lack of quality control and geno- 
typing errors peculiar to a genotyping platform as the problematic issue, rather than a flawed experimental 
design. We believe that this study can become a learning path for improving experimental design, if we 
converse openly among the scientific community. 

A "lack of quality control" provides a clue that we are observing a symptom of a broader prob- 
lem; indeed, a whole class of post-experiment statistical methods has emerged to address confounding. In 
GWAS, for instance, genotype measurements may be filtered based on call rate for departure from Hardy- 
Weinberg equilibrium or for having a low minor allele frequency. These methods that remove outliers 
resulting from flawed experimental design represent a palliative, not a cure. In our view, the GWAS re- 
search community has too often accommodated bad experimental design with automated post-experiment 
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cleanup. We infer this in part because sometimes researchers seem unaware that experimental design can 
be a source of confounding, resulting in suspect scientific conclusions. While good statistical practice 
examines every outlier, experimental designs for large-scale hypothesis testing have produced so many 
outliers that the field has made it standard practice to automate discarding outlying data. In our anal- 
yses of GWAS with consistent protocols for sample collection so that avoidable sources of variability 
are removed, with proper design of the genotyping experiment over plates ("blocking what you can and 
randomizing what you cannot" [Box and others, 1978]), we are learning that post-experiment automated 
filters are unneeded. Rather, one can examine each signal of association in turn for biological relevance. 

3. Reconsidering "experiments" 

A first step in increasing clarity in biostatistics experimental design acknowledges the broader continuum 
in which biostatistical research resides. In the scientific method, an experiment arbitrates between com- 
peting models or hypotheses. Unlike with mice and fruit flies, experiments involving humans cannot take 
the liberty of cloning 2 copies of a person, disabling a gene on one, keeping all other factors constant, and 
observing the effects. Thus, most studies on humans are observational because it is not practical to fit the 
system under study into a laboratory setting. In observational studies, therefore, great attention must be 
given to identifying and accounting for confounding factors. 

Even though we often refer to GWAS as "experiments," they are observational studies or quasi- 
experiments. In this realm, there are 2 useful types of studies. One seeks signals (associations among 
variables) as inputs for generating hypotheses. The other starts with a hypothesis — or multiple candidate 
hypotheses (Chamberlin, 1890) — and designs one or more studies that pursue systematic hypothesis falsi- 
fication. Both kinds of studies using observational data aim to generate and refine cohesive causal theories 
consistent with, or not contradicted by, numerous observations of the empirical world. In a paper advo- 
cating applying Karl Popper's philosophy of science to epidemiology, Buck (1975) acknowledged that 
some scholars relegated the epidemiologist to the role of data gatherer through association testing. But 
she invoked the example of John Snow, a "father" of modern epidemiology, who identified the "causal" 
mechanism of a cholera outbreak in London, as part of the strong deductive-reasoning heritage of the 
field. Buck characterized many epidemiological studies as "quasi-experiments" but urged epidemiolo- 
gists to form hypotheses prior to designing collection of observational data and, moreover, to embrace 
the "exciting process of deduction" since systematic refutation of hypotheses deduced from theory can 
inform and alter causal understanding. 

The obvious benefits of large-scale observational data sets lie in increasing the statistical power and the 
chances that the findings will be generalizable beyond the sample. In a quasi-experimental context, how- 
ever, we must consider how to design collection and analysis of observational data to yield the strongest 
possible findings elucidating mechanisms in studied biological systems. What processes can we establish 
to increase recognition of noncontrolled variables in experiments? To add nuance to these challenges, we 
reflect on current practices in data collection and experimental execution. 

4. Reviewing data collection and experimental execution in the GWAS context 

While an observational study should perhaps be initiated with hypotheses in mind, in current practice, col- 
lecting large amounts of biological data on populations of humans or other organisms often commences 
with an assumption that many hypotheses will be generated "after" data collection. These data sets can 
yield fruitful exploration, yet we must recognize the risks. When a data analyst using a data set collected 
by others (for possibly unrelated research questions) forms a hypothesis, he may be unaware of sources 
of variability in the data. For efficiency's sake, there may be divisions of labor in data collection pro- 
cesses arising from geographic dispersion or quantity of work. From our own experiences, we know that 
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variations in data collection protocols are common, and locating documentation is difficult. Even analysts 
who make every effort to understand sources of variability may find little with which to work. 

A widespread practice of "borrowing controls" from separate data collection efforts, often for 
economic reasons, can prove risky, as demonstrated by the Sebastiani and others (2010, 2011) paper 
and subsequent retraction. An important way to retain the large economic savings and statistical power 
benefits of combining data sets from disparate studies is to block randomize by phenotype and strive 
for equal proportions of cases and controls in each study. Problems with past studies may be mitigated 
with filters, imputation, and resequencing, and certainly many valid findings have emerged despite some 
experimental flaws. But since appropriate design incurs little extra cost, we can hardly justify making 
similar mistakes going forward, especially as we enter an era dominated by large-scale experimentation 
with next-generation sequencing. 

Divisions of labor arising from specializing in parts of experimental execution may introduce addi- 
tional separations between hypotheses and experimental design. The trio experiment mentioned above 
represents an instance of disconnections between laboratory scientists who plated the biological sam- 
ples and the research analysts and their scientific questions. Some may view divisions of labor among 
data collectors, laboratory scientists, and research analysts as a necessary evil. But when we consider the 
high costs of collecting large amounts of biological data and the higher costs of implementing new or 
changed health care policies, it seems an inordinate and unacceptable risk "not" to involve a statistician in 
designing both sample collection and protocols for performing subsequent measurements. The growing 
use of post-experiment "quality control" efforts may in large part arise from failures to obtain substantive 
statistician participation early in experimental designs. We advocate for ongoing conversations about what 
approaches offer the most scientifically rigorous ways to probe the data made available in large observa- 
tional data sets and what processes can ensure coherent experimental design and execution across roles, 
given gaps in time, space, and research purposes. To explore further, we must acknowledge that associa- 
tion studies using large-scale observational data sets use "hypotheses" in ways different from many other 
experimental undertakings. 

5. "Hypotheses" in the association study context 

Collecting samples without specifying preliminary hypotheses leads to a frame of mind in which genetic 
assays are run without an experimental design that accounts explicitly for the hypotheses to be tested — in 
other words, not taking steps to randomize and so prevent biases in data. As with many large biological 
data collection efforts, GWAS are premised on an overarching hypothesis that a disease has a genetic 
cause, rather than many more granular hypotheses designed to challenge a causal theory of how a disease 
manifests. A GWAS might test a million null hypotheses that case-control status and a given genomic SNP 
are statistically independent over the population sample. The experiment to falsify these null hypotheses 
is an automated search for indications of association between genetic variants and disease status. 

In the traditional scientific method, a hypothesis is a proposed explanation for a phenomenon, but 
scientists use the hypothesis to deduce additional predicted effects which, if observed, can corroborate 
the hypothesis but not prove it and, if not observed as predicted, can falsify the hypothesis. A single 
observation can falsify a traditional scientific hypothesis. 

Such a hypothesis might describe a force or property common to all members of a population. A 
hypothesis of this sort must make universal claims in order to produce repeatable experiments; then a 
hypothesis tested by experimenting on a member of the population could be refuted for the entire popu- 
lation. If we abstract a hypothesis-forming process this way — "I have observed property X in a number 
of instances of A; I hypothesize that all other instances of A will have property X" — then the under- 
lying premise is "All A's are effectively identical." If all A's are effectively identical, then we should 
accept that a corroborated or refuted hypothesis can inform us about the universe of A's. If A's are billiard 
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balls, and we are looking at properties of how they transfer momentum on a billiard table, we would 
expect all instances of these A's to have the same properties. If, instead, A's are people, and we seek 
to identify universal rules about behaviors or effects of environment on disease status applicable to all 
people, the premise of uniformity can lead us astray (see Meehl, 1990, pp. 200-201, for his discussion 
of ceteris paribus). Once we categorize nonidentical objects together and make cause-effect statements 
about them, we can speak about population parameters — but in this context a single observation cannot 
falsify a statistical hypothesis about a population of unlike objects. 

What Meehl (1967) called the "strong" form of null hypothesis testing, compatible with Popper's 
(2002) approach of falsifying a scientific theory, is in line with Fisher's original formulation of a test of 
significance (Fisher, 1955, 1956; Gigerenzer, 2004; Gill, 1999). Fisher's test specified a null hypothesis 
which would be falsified if data analysis did not yield a significance level determined by the test statistic 
appropriate to the assumed or known data distribution. Falsifying hypotheses derived from theory can 
advance causal understanding in science, and hypothesis testing to "confirm" theories is statistically and 
scientifically unsound. 

Notably, biostatistics design methods have roots in Fisher's agricultural experiments (Fienberg and 
Tanur, 1966), but some basic principles underlying that work have been overlooked as the methods 
have migrated to other research realms. Fisher (1956) opposed misuses of null hypothesis testing, and 
Gigerenzer (2004, p. 589) characterized Fisher's view of null hypothesis testing as "the most primitive 
type of statistical analyses . . . used only for problems about which we have no or very little knowl- 
edge" (italics in original). It is thus appropriate to apply this method when one first examines a disease 
and has no idea where to look for genetic factors that may be causative, as we do in GWAS. Neverthe- 
less, we can recognize that, without a theoretical context, a hypothesis (whether corroborated or discon- 
firmed) can do little to enhance understanding of cause and effect. Moreover, without a clear falsifiable 
stance — one that has implications for the theory — associations do not necessarily contribute deeply to 
science. 

6. Multiple testing revisited 

Historically, epidemiology has focused on minimizing Type II error (missing a relationship in the data), 
often ignoring multiple testing considerations, while traditional statistical study has focused on minimiz- 
ing Type I error (incorrectly attributing a relationship in data better explained by random chance). When 
traditional epidemiology met the field of GWAS, a flurry of papers reported findings which eventually 
became viewed as nonreplicable. As a result, genome-wide thresholds of significance were instituted, and 
replication in an independent sample became the gold standard for GWAS publication. It would appear 
the field learned from its mistakes. Consider, though, that when genome-wide significance is not reached, 
researchers commonly take forward, say, 20-40 nominally significant signals from a GWAS and then run 
association tests for those signals in a second study, concluding that all the signals with a /j-value <0.05 
have replicated (no Bonferroni adjustment). Frequently 1 or 2 associations replicate — which is also the 
number expected by random chance. Then to further "confirm" the signal, the "replicated" signals from the 
20-40 signals tested in the second study are combined with the previous study data to compute /7-values 
considered genome-wide significant. This method has been propagated in publications, leading us to won- 
der if standard practice could become to publish random signals and tell a plausible biological story about 
the findings. 

Few large-scale GWAS have used high-throughput sequencing. Yet there are dramatic success stories 
of researchers locating the causal variant for a rare disease by sequencing a few affected and unaffected 
members of a family. What distinguished these is the explicit search for causative variants, working from 
the hypothesis that the causative variant lies in a protein-coding region and must be present in all affected 
members of a family and absent in the others. Because such a small number of variants satisfy these and 
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other filtering criteria, a putative cause can be identified, and predicted effects can be tested, by seeing, 
for instance, if the mutation causes protein-binding disruption in a subsequent experiment. 

7. Building on association studies findings 

GWAS research relies on the premise that, if there is a causal link between a genetic variant and a disease, 
we will see significant association. As noted, however, significant associations do not imply causation. 
If we can perturb many different parts of a biological system and see a disease, what specifically causes 
the disease? With association studies, we metaphorically poke different parts of a biological web to assert 
that a particular strand might cause the disease status. Perhaps, a more aggregated causal explanation is 
needed, such as if the sum of various components perturb the system's equilibrium by a certain degree — if 
many strands of the web are stretched beyond a certain extent — the effect is to push the system into a 
disease state. We also expect that future biostatistics research will require extensive longitudinal pathway 
measures to elucidate causal relationships in aggregated biological systems. 

Because every biological system exists in a varying environment to which it responds, and because 
delays between sub-part interactions and their effects in observable macro variables can create nonlin- 
earities or tipping points, the interrelationships in biological systems can appear intractable. Although 
genome research has sometimes been cast as capable of providing unequivocal cause-effect explanations, 
an association between biological system "case" status and a tiny subset of a biological system such as 
an SNP may not be as informative as society would hope. Pathway analysis approaches show promise, 
and we must still guard against publishing more unfalsifiable correlations without causal follow-up. We 
can also discuss openly analytical approaches that render association more meaningful in the context of 
interdependent biological systems. In an era of molecular research, we may reconsider the usefulness of 
the concept of named diseases, classifications that may have been formed long ago based on symptoms of 
like appearances that may have unlike causes. 



8. Falsifying hypotheses and valuing mistakes 

For over a century, periodically authors call for renewing the scientific method and warn against seeking 
confirmation of an explanation of reality, rather than trying to falsify it (see Chamberlin, 1890; Ioannidis, 
2005; Piatt, 1964, for notable examples of the genre). Calls for clarity and rigor in experimental design 
remain as relevant as ever, especially since public and private entities invest increasing resources in large- 
scale biological research. While our purpose here is to raise questions rather than presume that we know 
answers, we nevertheless suggest some actions that can strengthen our field. 

We can undertake to learn methods that help us recognize what we do not know. Scientists take as given 
the methodologies described in their particular literature and practiced by leaders in their field. The critical 
mass of methods appearing in the literature matters because, as Kuhn (1962) noted, an accumulation of 
apparent consensus can discourage diffusion of other valuable approaches because they are viewed as 
inconsistent with or simply different from the dominant paradigm. A biologist trained in biology alone 
may find herself unprepared for large-scale experimentation unless she also undertakes systematic study 
of statistical methods. Let us talk openly about interdisciplinary collaborative approaches that lead us to 
unlearn previously unquestioned assumptions as well as create new knowledge. 

We can refrain from overstating in publication "what science tells us." Both contemporary statisti- 
cal methods and the traditional scientific method agree that nothing can be accepted as certain. We can 
articulate clearly limits to generalizing findings from sampled populations. When conducting research 
to inform public policies, we can advocate for complementary causal mechanism studies even as we rec- 
ognize that sometimes policy and health practice decisions must be made urgently in light of associations. 
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A key may lie in editorial boards and reviewers for journals such as this since publication is a major gate- 
way to scientific reputation. Almost all association-based studies assert that the identified signal makes 
sense in terms of either previously published associations or causal studies of biological mechanisms re- 
lated to the signal. If journals were to insist that association studies also suggest possible experiments that 
could falsify a putative theory of causation based on association, the quality and durability of association 
studies could increase. 

Mainstream publications also provide a point of leverage. Since many people do not understand that 
association is not equal to causation, we can direct public attention to findings and theories that have with- 
stood extensive attempts at falsification. A survey of medical findings reported in newsprint (Bartlett and 
others, 2002) found that observational studies were much more likely than randomized clinical trials to be 
reported in the press, despite that randomized studies are proportionally overrepresented in research press 
releases and are less subject to biases common in observational studies. Aware that media favor report- 
ing observational studies, we can diligently report findings using the recommendations of the STROBE 
guidelines (von Elm and others, 2007). 

Management scientist and operations researcher Ackoff (2006) said he was often asked regarding 
his advocacy of systems thinking, "If this way of thinking is as good as you say it is, why don't more 
organizations use it?" Ackoff answered that the challenge exists in adopting any transforming idea: we 
are afraid of making mistakes, errors of commission. In a world that punishes mistakes and regularly 
disregards often more devastating errors of omission, publishing correlations may provide a temptingly 
reassuring path. Correlations are not easily falsified; nonreproducibility may be explained in many ways, 
including by population differences. Professionally, potential reward and drastically reduced risk can re- 
sult from publishing correlations. Applying well -published but perhaps flawed methods of experimental 
design also presents a path of safety; if one uses a method published in Nature or Science, reviewers 
may be less inclined to question one's approach. Consistent with Kuhn, researchers (Young and others, 
2008) suggest that a pattern of "following the leader" is further amplified by path dependency resulting 
when "early decisions by a few influential individuals as to the importance of an area of investigation 
consolidate . . . the trajectory" of research (p. 1420). 

The problem with avoiding mistakes, Ackoff noted, is that we learn, that is, update our mental models 
of reality, only "through" mistakes, through falsification, not from what confirms our expectations. As 
long as our culture sets professional impact and personal security at odds with the scientific method's 
requirement of falsifying hypotheses, its use will be hindered. The scientific progress achieved despite 
the underlying cultural conflict is amazing — and yet imagine what may emerge if this obstacle were re- 
moved. A real lever for scientific progress may lie in how we treat our children, one another, and foremost 
ourselves with respect to our mistakes. 

Persistent inference of multiple falsifiable hypotheses challenging existing theories provides one of the 
fastest ways to probe the unknown. Popper wrote that bold or daring scientific conjectures set apart great 
science from the rest, and he called a conjecture "daring if and only if it takes a great risk of being false — 
if matters could be otherwise, and seem at the time to be otherwise" (Popper, 1985, pp. 118-119). Bold 
claims that are just as boldly disproved by (even our own) scientific experiments merit scientific praise 
and publication — and this is what society wants of science, to probe and provide causal explanations 
on issues of widespread concern. We can contribute by rethinking assumptions and practices underlying 
statistical analyses of observational data sets to increase the rigor of experimental design, execution, and 
interpretation in efforts to falsify, systematically, our best explanations of what we think we know. 
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